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Abstract 

Modern GPUs are able to perform significantly more arithmetic opera- 
tions than transfers of a single word to or from global memory. Hence, many 
GPU kernels are limited by memory bandwidth and cannot exploit the arith- 
metic power of GPUs. However, the memory locality can be often improved 
by kernel fusion when a sequence of kernels is executed and some kernels in 
this sequence share data. 

In this paper, we show how kernels performing map, reduce or their nested 
combinations can be fused automatically by our source-to-source compiler. 
To demonstrate the usability of the compiler, we have implemented several 
BLAS-1 and BLAS-2 routines and show how the performance of their se- 
quences can be improved by fusions. Compared to similar sequences using 
CUBLAS, our compiler is able to generate code that is up to 2.61 x faster 
for the examples tested. 

Keywords: GPU, CUDA, BLAS, Kernel fusion, Code generation, 
Automated tuning 



1. Introduction and Motivation 

Today's accelerators, such as CUDA GPUs, are able to perform tens of 
arithmetic operations in the time that it takes for a word to be read from or 
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written to global memory. Moreover, the dominance of arithmetic power over 
memory bandwidth grows with each new hardware generation^ The input 
and output of each GPU kernel (i. e. the subprogram executed on GPU) 
has to be stored in the global memory. Thus, many kernels with low flop- 
to-word ratio are memory-bound. When such kernels executed in sequence 
share some data, performance may be improved by placing the shared data 
in some significantly faster on-chip memory. Although global memory is 
cached in new GPUs, caches usually cannot hold whole output of the kernel. 
However, the memory locality can be improved by fusing these kernels into 
a larger ones and placing shared data into on-chip memory explicitly. 

The number of possible fusions is high as each fusion is created accord- 
ing to sequence of kernel calls and data dependency between them. Thus, 
re-usability of fused kernels is limited. Because of this, it is impractical to 
produce libraries consisting of already-fused kernels. Instead, it is more prac- 
tical to use the library of simple and re-usable kernels and automatically 
generate fusions when the sequence of kernel calls is given. 

It is difficult to fuse generic kernels automatically, but automation of 
fusion becomes possible when the type of operations performed by kernels is 
limited. In this paper, we study automatic fusions of kernels performing map, 
reduce or their nested combination. In our approach, the function applied by 
map or reduce can run in multiple threads. Thus, it can efficiently process 
larger amount of data, which allows common optimization of memory locality, 
such as tiling. 

In this paper, we present kernel fusion as an optimization method and 
show how it can be automated by our source-to-source compiler when the 
type of fused kernels is restricted to map and reduce. The compiler works 
with a library of elementary functions and a script calling functions from the 
library. It fuses selected functions to improve their performance and preserve 
the semantics defined by the script. We note that fusing all kernels cannot 
always maximize performance. Thus, the compiler searches and prunes the 
optimization space to find efficient fusions. 

We address two main use cases by our approach. 

• Using fusion- equipped libraries. Some general purpose library can be 
implemented to be usable with our compiler. In that case, library 
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users can write only script calling library functions without the need 
to care about their GPU-specific implementation. The advantage of 
this approach is that library functions are automatically fused by our 
compiler, improving their performance. 

• Simplification of fusion optimization. In some cases, it is meaningful to 
develop both the script and the library (even if is not widely reusable) 
and use our compiler to find efficient fusions. First, many combinations 
of library function calls may be needed which makes the manual fusion 
time demanding and error-prone. Second, the optimization of code-to- 
kernels distribution may be hard (one such example is presented in our 
previous paper [1]). 

To demonstrate the performance benefit of kernel fusions generated by 
our compiler, we have accelerated several sequences of BLAS (Basic Linear 
Algebra Subprograms) routine calls. BLAS is a library of linear algebra 
routines, which is frequently used in scientific computation and is believed to 
be well-optimized. The BLAS-1 (vector- vector) and BLAS-2 (matrix- vector) 
routines are memory-bound, thus their sequences are good candidates to be 
improved by fusions [21 [3] . We show that fusing several BLAS routines into a 
single kernel can significantly improve performance by reducing the number 
of memory transfers. This performance improvement cannot be achieved by 
tuning unfused kernels separately. 

The rest of the paper is structured as follows. The overview of work 
related to our research is given in Section [2j The general discussion about 
performance impact of kernel fusion as well as its automation can be found 
in Section [3j whereas Section [4] describes the compiler allowing automatic 
fusions. The performance of a code generated by our compiler is evaluated 
in Section [5j The Section [6] concludes the paper and sketch the future work. 

2. Related Work 

The code-to-kernel distribution can be optimized by kernel fusion, or by 
generation of kernels of optimized size from a code which is not explicitly di- 
vided into kernels (monolithic implementation or some high-level language). 

The kernel fusion is allowed in some frameworks working with algorith- 
mic skeletons. Algorithmic skeletons that allow automatic parallelization are 
predefined higher-order functions performing given user-defined first-order 
functions [H [5] . The SkeTo framework automatically fuse skeletons to spare 
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global memory transfers [5J. The fusion is possible also in Thrust [7], but 
the programmer has to explicitly set the kernels to be fused. The significant 
difference of our approach is that first-order functions can be parallel, which 
allows them to process larger data (e.g. small tensors [lj or matrix tiles), 
whereas user-defined functions executed by skeletons are serial. Second dif- 
ference is that we search fusion optimization space to discard suboptimal 
fusions. 

In array programming, one defines the transformations of whole arrays 
using element-wise operations, reduction etc. [8] Although array and skeletal 
programming introduce different programming models, the transformations 
of arrays performs usually similar operations as skeletons and there is simi- 
lar opportunity to perform several transformations within single kernel. The 
Barracuda compiler is able to fuse arrays and perform operations on these 
arrays in a single kernel [9]. The fusions are performed whenever it is pos- 
sible, without considering on-chip resources consumption. A similar fusion 
mechanism is implemented in Copperhead [TU], which is a high-level data- 
parallel language embedded in Python. It seems that both Barracuda and 
Copperhead do not discard suboptimal fusions. A programmer cannot write 
the native code of the transformation applied to array's elements (i. e. first- 
order functions), thus he or she cannot explicitly define parallel per-element 
code or implement any low-level optimization, which is possible in our ap- 
proach. On the other hand, our approach is more low-level, as our compiler 
fuses functions written in C for CUD A. 

A tool by Gulati and Khatri [TJ] automatizes the partitioning of the input 
code into kernels and automatically generates the code of output kernels. The 
input code performs serial computation, the output code performs the com- 
putation multiple times in parallel (thus it is application of map function). 
The paper shows that the optimization of resource usage by partitioning of 
the code into several kernels may in some cases improve the performance over 
monolithic implementation even if the data locality is worse in the partitioned 
code, which agree with our results. 

A fusion method improving energy efficiency of CUDA kernels is pro- 
posed in [12]. This method does not aim at improving the execution times 
of kernels, as kernels are fused without improving data locality. Instead, it 
aims at balance the demand for hardware resources, resulting in lower power 
consumption, but similar execution times. Similarly to our method, the 
previously-implemented kernels are fused instead of automatic paralleliza- 
tion. 



4 



The idea of optimizing sequences of BLAS functions by their fusion is 
not new, however, to the best of author's knowledge, no system allowing 
fusions on GPUs has been published. Belter at al. [2] introduce a BTO BLAS 
compiler, which is able to fuse BLAS functions targeting modern CPUs. The 
DESOLA active library, presented by Russell at al. [T3], performs fusions in 
time the BLAS functions are called, i. e. without a previous compilation. The 
main difference between our research and those presented in j2] and p3] is 
that we target GPU architecture, thus the technique of the fusion significantly 
differs. We fuse parallel kernels instead of loops, which requires different 
techniques to perform the fusion correctly.Moreover, the optimization space 
search and performance prediction also changes due to different nature of 
GPUs. Our approach addresses multiple types of computational problems, 
whereas BTO BLAS and DESOLA focus only on BLAS. Our approach uses 
the hand-tuned routines, whereas BTO BLAS uses high-level description 
of BLAS routines and DESOLA implements initial BLAS functions (which 
are further optimized automatically) in language similar to C, without any 
architecture-specific optimizations. 

In our previous papers, we have introduced basic principles of our com- 
piler [H] and its non-trivial application together with improved efficiency of 
data exchanging between fused kernels pQ. Nevertheless, the compiler in- 
troduced in these papers was restricted on map kernels, which significantly 
limits its applicability. In this paper, we discuss fusion of nested map and 
reduce and show the structure of generated code and process of its generation 
in deeper detail. 

3. Kernel Fusion 

In this section, we discuss kernel fusion in more detail, but still as a general 
concept, i. e. independently of the design and implementation decisions made 
for our compiler. First, the performance advantages and disadvantages of 
fusions are discussed. Second, the properties of map and reduce functions 
allowing their automatic fusion are described. Finally, the implementation 
of BLAS routines as fusible kernels is introduced. 

3.1. Fusion as an Optimization Method 

The main advantage of fusion is the improvement of memory locality. In 
CUDA, each kernel has to read its input from and store its output to off-chip 
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global memory. When two kernels share some data, their fusion can hold 
shared data in on-chip memory — registers or shared memory. 

Consider the example depicted in Figure [I] left, where z = fs(fi{x), fi{y)) 
is evaluated. When /i,/2 and / 3 are fused into a single kernel, the results 
computed by fx and f 2 can be held in on-chip memory and immediately 
used by / 3 . Otherwise, the outputs of j\ and f 2 have to be stored in global 
memory and loaded by / 3 . If the performance of fi, f 2 or / 3 is bounded 
by global memory bandwidth, the fusion increases performance by reducing 
global memory data transfers. 

An additional benefit of kernel fusion is the reduction of kernel launch 
overhead (a lower number of kernels are launched). Moreover, the fused 
kernels are more complex, thus the optimizing compiler has more room for 
optimizing the instructions, such as common subexpression elimination (e. g. 
data indexing can be the same or similar for multiple functions included in 
single fusion), loop fusion, etc. 

Besides the performance improvements mentioned above, fusion may also 
decrease performance. The occupancy of the GPU (the number of warps that 
can concurrently run on the GPU) must be sufficient to hide the memory 
latency |15j . When a kernel requires too much on-chip memory, occupancy 
is limited and the memory latency can decrease performance. When such a 
kernel is fused with another, occupancy is limited for the whole fused kernel. 
Thus, it is possible that the overall performance is better when the kernel 
which limits occupancy is not fused. 

Another factor that can limit occupancy is the storage of additional in- 
termediate data in on-chip memory. Consider the example mentioned above. 
In the fused kernel, fi and f 2 have to be performed before / 3 in any ordering. 
Suppose that fx is performed before / 2 . In this case, when f 2 is performed, 
the result of f\ must be held in on-chip memory, thus at least for f 2 the con- 
sumption of on-chip memory is higher compared to the unfused version. This 
example is depicted in Figure [T] right, where the execution of every unfused 
kernel would consume less on-chip memory compared to the fusion. 

Finally, the optimal number of threads processing data elements can vary 
for different kernels. When such kernels are fused, some of them have to use 
a suboptimal number of threads, or some threads idle in part of the compu- 
tation (but hold on-chip resources), thus fusion may decrease performance. 

As we have shown, kernel fusion may increase as well as decrease the 
performance. The number of possible fusions and their combinations is large 
(see Table [4] or pQ) and a manual search for the best-performing one is time- 
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Figure 1: Computation of z = h{h{x), f 2 (y)) as x' = fx{x), y' = f 2 (y), z = f 3 (x',y'). 
Left: data movement of unfused and fused versions. Right: On-chip memory allocation in 
unfused and fused versions. 

consuming and error prone. Thus, the automatic generation of efficient fused 
code is necessary. 

3.2. Kernel Fusibility 

To fuse two kernels, one has to correctly glue kernel codes into a single 
kernel preserving the original functionality. The automatic fusion of generic 
kernels is difficult for two main reasons. 

• On-chip memory is distributed. Some data, which was originally ex- 
changed via global memory, resides in on-chip memory in fused kernel. 
This data can be transfered via on-chip memory when the following 
holds for all kernels to be fused: (i) kernels use the same mapping of 
threads to exchanged data placed in registers and (ii) kernels use the 
same mapping of thread blocks to exchanged data placed in shared 
memory. Thus, the automatic analysis of this mapping and its modifi- 
cation is needed. 

• Global barrier is not available inside kernel. Kernel execution creates 
a global barrier, which cannot be generally implemented within a ker- 
nel. Two or more kernels can be fused only if this global barrier is 
not necessary, i. e. it can be replaced by a local barrier or avoided en- 
tirely. Thus, it is needed to automatically determine whereas the global 
barrier can be avoided. 
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In our paper, we have restricted the types of kernels to map and reduce 
and their nested combinations (mapped map, or mapped reduce - a map 
function cannot be used as a reduction operator). These kernels have a wide 
range of applications as map and reduce have sufficient expressive power for 
many computing tasks. Also, map and reduce allow automatic fusion, as is 
shown below. We note that the method of fusion is general, therefore more 
types of kernels could be fused automatically, although we currently do not 
support them. 

The idea of fusing GPU kernels performing map and reduce has been 
presented in several papers [SJ El EH ES] • However, the map and reduce can 
execute parallel first-order functions in our case, which makes their fusions 
more complicated. Thus, we discuss in more details how to fuse them. 

3.2.1. Map Kernels 

Let Lj = [e\,e l 2 , . . . , e l m ] is a list of m elements e\, . . . , e l m . The map is 
a higher-order function which applies a given n-arj|^] function / element- wise 
to all elements of the lists Li, . . . , L n , producing the list of results: 

map(f, Li, . . . , L n ) = [f(e\, . . . , e?), . . . , /(e^, . . . , e™ )] 

Suppose two data-dependent calls of map function map(g,map(f, L)), 
L = [ei, . . . ,e n ]. The mapped functions f,g can be fused, i.e. kernel per- 
forming map(g o f,L) can be created, if and only if Wi 6 [l,n], /(e») and 
g(f(ei)) run in the same (single) thread block. It guarantees that the result 
of / can be transfered to g via on-chip shared memory, as the shared memory 
is visible for all threads within the same block. It also guarantees that no 
global barrier is needed, as no data are exchanged between blocks. We note 
that single instance of each mapped function has to fit into thread block, i. e. 
has to use reasonable number of resources (threads, on-chip memory). 

3.2.2. Reduce Kernels 

Let © be a binary associative operator. The reduce is higher-order func- 
tion applying given operator © recursively to all elements of the list L build- 
ing a resulting element. 

reduce(®, L) = e x © e 2 © e 3 © ■ ■ ■ © e n 



2 Some languages uses map only for unary functions and introduce zipwith for n-ary 
functions. 
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The result of reduction is constructed using all elements e\ . . . e n . As 
multiple thread blocks are used to utilize GPU, a global barrier is needed to 
obtain the result of reduction. The important consequence of global barrier 
need is that the result of the reduction cannot be used in the same kernel 
where the reduction is computed. Nevertheless, we can fuse a reduction 
kernel with other kernel. Because of © associativity, a partial reduction can 
be computed locally per thread block without global barrier and thus reuse 
on-chip data (produced by map, or shared input of another reduce function). 
The final result of reduction is obtained after global barrier by reducing 
results of all partial reductions. 

We note that the final result of the reduction can be computed by several 
ways (i) by extra kernel, (ii) by the last running block of kernel performing 
partial reduction (global barrier is replaced by test of termination of all other 
blocks) or (iii) automatically after kernel is finished when atomic instructions 
are available. 

3.2.3. Local Barriers and Registers 

Let / and g are functions being fused. The thread-to-data mapping of 
/, g is same if and only if each word transfered from / to g is stored in / 
and loaded in g by the same thread. As our mapped functions or reduce 
operators can be parallel, the thread-to-data mapping can differ in kernels 
being fused. In that case, data has to be transfered via shared memory and 
local barrier needs to be performed between kernel codes. 

The local barrier is not needed between / and g when the thread-to-data 
mapping is same. When all functions accessing data element e access them 
with same thread-to-data mapping, and the access is not data-dependentQ 
the element e can be stored in registers. 

The Figure [2] illustrates an example of kernels fusion, showing fused kernel 
performing map(g o f,L). In this example, two instances g(f(e2i-i)) and 
g(f(^2i)) run in a single thread block. As no instance is divided among 
thread blocks, data can be passed via on-chip memory between functions / 
and g. Each function is performed by a different number of threads, and let 
all threads of / write a result in this example, thus data exchanged between 
them cannot be placed in registers and thus must be placed in shared memory. 



3 Data element can be placed in registers only if their indexing can be determined in 
compile time |15j . 



9 



thread block 



thread block 1 



local 
barrier 



/(ei) 


/(<*) 


iiiiiii 


iiiiiii 


ff(ei) 
iiiii 


iiiii 



/(<*) 


/(e 4 ) 


iiiiiii 


iiiiiii 


3(es) 
iiiii 


iiiii 



Figure 2: Fused kernel performing map(g o /, L). 



Finally, a local barrier has to be used. 

We does not consider a fusion of functions with different nesting depth, 
as it yields redundant execution of functions with lower nesting depth. We 
note that all rules discussed in this chapter are same for nested and unnested 
map and reduce. 

3.3. BLAS Functions Expressed as Map and Reduce Calls 

In this paper, we use several BLAS functions or sequences of BLAS func- 
tion calls as demonstration of described mechanisms used in our compiler. 
Thus, we first describe how to express BLAS functions to be usable with our 
compiler. 

Many BLAS functions can be expressed as a map, a reduce or a combi- 
nation of the two. Thus, it is possible to fuse them automatically, i. e. to 
perform several BLAS functions as well as their fragments within a single 
kernel. 

BLAS-1 implements vector-vector operations. Each vector can be ex- 
pressed as a list of elements (scalars, or small sub- vectors) . BLAS-1 oper- 
ations process vector elements independently (i.e. perform map), perform 
reductions, or a combination of the two. For example, DOT kernel (z <- x T y) 
multiplies each element from vector x by corresponding element of vector y 
(map) and sums the results of multiplication over all elements (reduce). Cur- 
rently, we have not implemented all BLAS-1 operations, however, we are not 
aware of any BLAS-1 function that cannot be implemented in the described 
model. 

A more complicated situation arises when BLAS-2 functions, which im- 
plement matrix-vector operations, are considered. The matrix can be seen 
as a list of vectors, where vectors represent rows or columns. Then, we could 
implement BLAS-2 functions as a mapping of scalar-vector functions (where 
scalars are elements of vectors and vectors are elements of matrices). How- 



10 



ever, vectors representing matrix columns or rows may be too large to be 
stored in on-chip memory, which makes fusions impossible. To overcome 
this limitation, we introduce nested calls of map and reduce, which allows 
us to further divide vectors representing matrix rows or columns to smaller 
elements and process these elements with map or reduce. 

Consider the matrix-vector multiplication y = Ax as an example of 
a BLAS-2 function. In its computation, for the z-th row of A, the dot product 
of the row and x is computed and stored to the z-th position in y. 

Without nesting, we represent y = Ax as: 



where A is a list of rows A±, A2, . . . , A ny and Ob IS cL list of elements of the 
vector x. The single instance of function dotprod computes dot product of 
row A4 and vector x, which may be too large for on-chip memory. 
Using nested map and reduce, y = Ax can be expressed as: 



where A is a list of rows A\, A2, ■ ■ ■ , A n , each row A, is a list of elements in 
i — th row and a; is a list of elements of the vector x. 

We note that we can view dimensions of multi-dimensional structures in 
any order - e. g. matrix can be viewed as a list of rows or list of columns. 
Thus, we can similarly express operations for transposed matrices. 

The current BLAS-2 standard cannot be fully implemented using our 
model based on map, reduce and their nested combination. For example, 
we cannot handle symmetric matrices stored in some packed format in this 
model. To overcome this limitation, more general data structures have to be 
supported by our compiler in the future. 

4. The Compilation Process 

Based on observations given in Section [3j we have developed a source-to- 
source compiler, which is able to optimize sequence of kernel calls by kernel 
fusion. In this section, we focus on the process of creating fusions and fusion 
code generation and briefly describe the process of fusion space exploration, 
which is discussed in more detail in our previous paper [14] . 




(1) 



y = map(reduce(+, map(-, Ai, x)), A) 



(2) 
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4-1. Compilation Stages 

Our compiler works with a special form of kernel implementation con- 
taining CUDA code implementing some higher-order function applying some 
first-order function on many elements — we call this special form of a kernel 
elementary function. The main purpose of the compiler is to transform a se- 
quence of elementary function calls into the sequence of kernel calls, where 
single kernel can include one or more elementary functions, maximizing per- 
formance of output code. 

Recall that the input of our compiler consists of a high-level script and 
a library of elementary functions. Each elementary function can be present in 
several alternative implementations in the library with different performance 
characteristics. The script calls functions from the library, thus it defines the 
sequence of function calls and data dependencies. 

The compilation process is divided into three main stages: 

• parsing the script and library (reading elementary functions and their 

metadata); 

• generation and search of the optimization space; 

• code generation. 

The script and metadata parsing is straightforward and is not discussed 
here. The optimization space exploration and code generation are discussed 
in the following sections in more detail. 

4-2. Generation and Search the Optimization Space 

The input script is parsed creating data dependency graph, where ver- 
tices represent elementary function calls and edges represents data depen- 
dency between functions. Having data dependency graph build and library 
data parsed, the code without fusions can be generated (i. e. each elemen- 
tary function is translated to separated kernel). However, there is usually 
a large number of possible codes with fusions. Thus, the optimization space 
is generated and searched for the code with the best expected performance. 
There are three main steps in the generation of the optimization space. 

• Generation of fusions. We define fusion as a fusible subgraph of data 
dependency graph (selection of elementary functions, which can be 
safely fused without influencing input program semantics). At this 
step, a space of all reasonable fusions is generated. 
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• Generation of fusion implementations. Each fusion can be implemented 
in many different ways, differing in (i) calling order of functions (which 
can affect the amount of needed on-chip memory), (ii) chosen imple- 
mentations of elementary functions, (iii) block size or (iv) number of 
serial iterations. At this step, implementations of each fusion are gen- 
erated and their performances are predicted. 

• Generation of combination of fusion implementations. The combina- 
tion of fusion implementations is such a selection of fusion implementa- 
tions and unfused kernels, that covers all calls of elementary functions 
defined in input script and maximizes predicted performance. The 
combination of fusion implementations can be transformed to output 
CUDA code covering the whole computation given by the script. The 
generation of combinations can be repeated many times (omitting pre- 
viously selected combinations) to allow empirical search for output code 
with the best performance. 

During the generation of the optimization space, some implementations 
are automatically pruned, e. g. fusions which does not spare memory transfers 
or fusion implementations which use larger amount of on-chip memory per 
instance than another implementation of same fusion. After the pruned opti- 
mization space is generated, the performance of each fusion implementation 
is predicted. The fusion implementations with poor predicted performance 
are not definitely pruned — when sufficient number of combinations of fusion 
implementations is generated, they are used in some combination. 

The performance prediction works as follows. The basic idea of our per- 
formance prediction method is to sum previously benchmarked running times 
of routines according to the fusion implementation for which the performance 
is being predicted. More precisely, the running time of each routine is mea- 
sured in simulated fusion environment - certain ranges of the number of 
instances per block, sequential iterations and additionally allocated shared 
memory (which simulates additional data used within the fusion). When 
the runtime of some fusion F is predicted, the runtimes of routines used in 
F, matching the fusion environment of F, are summed. The time of data 
transfers (i. e. load and store routines) t t and computation (i. e. compute 
routines) t c are summed separately and the predicted runtime is computed 
as max(t t ,t c ). Thus, we assume full overlap of computation and data trans- 
fers. This is inaccurate when occupancy is low and the overlapping ability is 
reduced. However, in that case the timing of each routine is also poor, thus 



13 



fusions with low occupancy should not be favored even when full overlap is 
expected. 

Note that the benchmarking of routines is performed once per routine 
per GPU architecture and not at the time of compilation. The demand of 
benchmarking new routines is compensated for by the low sensitivity of our 
compiler to GPU architecture changes - as our performance prediction is 
based on empirical measurements, the new GPU architecture needs only re- 
benchmarking of routines rather than a re-design or re-parametrization of 
the prediction method. 

4-3. Creating Kernels from Elementary Functions 

Recall, that our compiler is not able to fuse generic kernels implemented in 
C for CUDA, but works with elementary functions. In fact, elementary func- 
tion used by our compiler contains nearly complete code of unfused kernel— 
however, it must fulfill several requirements described below. 

The elementary function is implemented to perform some higher-order 
function applying some first-order function on many elements (in current 
implementation, higher-order function can be map, reduce or nested combi- 
nation). The single instance of elementary function performs the first-order 
function to some input elements generating output element, i.e. when ele- 
mentary function performs map(f,L), single instance performs /(e,), when 
elementary function performs reduce(Q), L), single instance performs e2i-i © 
e2i- 

To be usable with our compiler, the elementary function is associated 
with metadata, which defines its properties, such as parallelism requirements 
or implemented higher-order function. Each elementary function has to be 
implemented in several routines (functions called from CUDA code): 

• load (separate for each input type), loads input data stored in global 
memory into on-chip memory; 

• compute performs computation on data in on-chip memory; 

• store stores data from on-chip memory into global memory. 

The decomposition of elementary function into routines is the core prin- 
ciple which significantly simplifies the code generation. The kernel is created 
as a sequence of load, compute and store routine calls. When some functions 



14 



are fused, the stores and loads for elements that remain in on-chip mem- 
ory are not called and the remaining calls are glued into single kernel (see 
Figure M for illustration of a simple fusion). 



/ load \ 
compute 
compute 

\ store j 



Figure 3: Illustration of a simple fusion. 

The compiler generates routine code, kernels calling these routines and a 
code encapsulating kernels allowing to empirically search for the most efficient 
one. 

4-3.1. Routine Code Generation 

First, the compiler generates routines: it copies their code from the li- 
brary, assigns values to macros and modifies local memory addressing, when 
registers are used to store input or output elements. Macros in routines have 
prescribed names and are used for the thread block size and number of iter- 
ations. They are used to allow the CUDA compiler to evaluate expressions 
which use them at compile time or unroll small loops. 

4-3.2. Main Kernel Structure 

When the kernel code is created from elementary functions, the compiler 
knowis the type of the higher-order function which is implemented by the el- 
ementary function (the type of higher-order function is defined in metadata). 
It allows the compiler to (i) generate the computation of thread and block 
indices and configure the grid size, (ii) force a global barrier before the result 
of the reduction is usecQ and (iii) correctly place loads of invariant variables 
and stores of accumulable variables (i. e. variables that can be accumulated 
outside of the cycle performing sequential iterations). 



4 This is trivially fulfilled in code generation stage, as outputs of all reductions are used 
outside of the fusion implementation performing the reduction, thus the global barrier is 
performed by finishing the kernel. 



/ load \ 



compute 
\ store ~7 

/ load \ 



compute 
\ store ~7 
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Algorithm 1 Schema of kernel 
1: allocate variables in shared memory 
2: create arrays in registers 
3: compute thread and block indices 
4: load invariant data 
5: clear outputs of accumulated reductions 
6: for i = 0; i < iters; i + + do 

7: call non-invariant load, compute and store routines 
8: recompute block indices 
9: end for 

10: call stores of accumulated reductions 



The unnested function runs in a ID grid. When more than one serial 
iteration is performed, the grid is adequately shrunk and block indices are 
recomputed in each iteration, simulating the execution of the full-sized grid. 
For the nested functions (recall that only nesting level 2 is supported in the 
current implementation), a 2D grid is used, and iterations shrink the grid 
in one dimension, working similarly to unnested functions. In the following 
paragraphs, we show structure of the generated code. 

Algorithm [I] sketches the basic structure of the generated (fused or un- 
fused) kernel. All data exchanged between routines via shared memory are 
allocated at the beginning of the kernel (line 1). Elements in shared mem- 
ory can overlap when possible to spare shared memory usage [H]. This is 
technically realized by allocating one large array and creating pointers into 
this array, representing data elements. For data elements stored in registers, 
local arrays are defined (line 2). The size of local arrays is set to the size of 
one element regardless of the fraction of element used by one thread pQ. 

The thread and block indices are set at line 3 according to real thread 
and block indices for the kernel. When some routine within the kernel needs 
different parallelism, indices are recomputed before this routine is called. 

For nested map or reduce, some input data elements can be invariant 
across iterations (e. g. for matrix-vector multiplication, a sub-vector can 
multiply several matrix tiles), thus invariant loads are called before the loop 
(line 4). Both nested and unnested variants of reduce can accumulate their 
result across iterations, thus their results are cleared before the loop (line 
5) and stored after the loop finishes (line 10). The rest of the routines are 
called within the loop (line 7) according to selected calling order and the 
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Algorithm 2 Schema of routine call 
1: call local barrier 
2: clear output of the reduction 
3: if thread participates then 
4: recompute thread indices 
5: call routine 
6: end if 



block indices are recomputed at the end of each iteration (line 8). 

Note that fusion of nested and unnested functions are not efficient, as it 
results in repetition of unnested operations and hence does not spare global 
memory transfers. Therefore, our compiler does not fuse functions with dif- 
ferent nesting level. Consequently, compute routines are always performed 
within the loop, as no result of a compute routine performed within the fusion 
is invariant across loop iterations. 

4-3.3. Generation of Routine Call 

In Algorithm [TJ routines are called at lines 4, 7 and 10. The more detailed 
schema of a generated routine call is described in Algorithm |2j First, the 
local barrier call can be generated. The local barrier before routine r is 
generated, if one of the following conditions holds. 

• Routine r accesses at least one input element e, that has been modified 
by routine s, and (i) thread-to-data mapping of access to e is different 
for r and s and (ii) no local barrier is called between r and s. 

• Routine r writes the element e into shared memory, and e overlaps with 
another element e', that is accessed after last synchronization called 
before r. 

The first condition ensures that all words of the element e are written into 
shared memory before they are read by r, when thread-to-data mapping is 
different in writing and reading the element. The second condition provides 
synchronization of all warps before element e' is rewritten by e to ensure that 
all routines accessing e' are finished before its rewriting. 

When the routine performing reduction is to be called and its output is 
not accumulated among iterations, the code clearing its output is generated 
at line 2. 
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If the routine is performed by a lower number of threads than it is available 
within the kernel, line 3 reducing the parallelism is generated. The code 
reducing parallelism is created to keep maximum of warps fully utilized, i. e. 
when parallelism is reduced from m to n threads, threads < 0, . . . , n — 1 > 
continue in computation whereas threads < n, m — 1 > stall. 

The thread indices recomputation (line 4) is generated when parallelism 
is reduced, or when the thread arrangement of the routine differs from the 
thread arrangement of the fusion (e. g. a routine need a block of 9 x n x 1 
threads for n instances, whereas the fused kernel uses block of 3 x 3 x n 
threads, i.e. the same number of threads, but different indexing). The 
compiler generates the indexing computation that maps adjacent indices to 
adjacent threads to create at most one under-populated warp. Moreover, 
as it knows the number of threads in each dimension required by routines 
in compile time, it optimizes the number of arithmetic operations needed 
to recompute indices. After the parallelism is restricted and indices are 
recomputed, the routine can be called (line 5) with new indices. 

4-4- An Example of Code Generation 

To demonstrate the compiler's features described above, we have chosen 
the computation of BiCGK sequence as an example. The sequence performs 

q = Ap 
s = A T r 

It demonstrates kernel fusion (g = Ap and s = A T r are implemented as 
separated elementary functions) as well as working with nested operation. 

Recall that the vector and matrix elements can be represented by a single 
number, or some larger structure. We are using a 32-number sub-vector as 
a vector element and 32 x 32 tile as a matrix element. These element sizes 
allow us to write efficient elementary functions, such as q = Ap or s = A T r, 
where single instance multiplies 32 x 32 matrix tile by sub- vector of size 32, 
giving good opportunity for manual optimizations. It implies that the size 
of A (and consequently all vectors) must be padded to 32 in each dimension. 

TILE32x32 A; 
subvector32 p, q, r, s; 

input A, p, r; 

q = sgemv (A, p) ; 
s = sgemtv (A, r ) ; 
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8 

9 return q, s; 

Listing 1: Script performing BiCGK sequence 

We have implemented elementary functions sgemv (q = Ap) and sgemtv 
(s = A T r). The sctipt performing BiCGK sequence is listed in Listing HI 



1 device void d_sgemv_l_load_l (TILE32x32 A, TILE32x32 s_A, 

2 int tx, int ty, int bx, int by, int sx) { 

3 #pragma unroll 

4 for (int j = 0; j < 32; j+=SGEMV_l_BY) 

5 s_A[ty*33+tx+j*33] = A [ (by*32+ty+j ) *sx*32 + bx*32+tx] ; 

6 } 
7 

8 device void d_sgemv_l_load_2 { sub vector 32 x, subvector32 s_x, 

9 int tx, int ty, int bx, int by) { 

10 if (ty == 0) 

11 s_x[tx] = x [bx*32+tx] ; 

12 } 
13 

14 device void d_sgemv_l_compute (TILE32x32 s_A, subvector32 s_x, subvector32 s_y, 

15 int tx, int ty) { 

16 float tmp = O.Of; 

17 #pragma unroll 

18 for (int j = 0; j < 32; j+=SGEMV_l_BY) 

19 tmp += s_A [tx*33+ty + j ] *s_x[ty+j] ; 

20 atomicAdd (s_y+tx, tmp); 

21 } 
22 

23 device void d_sgemv_l_save (subvector32 s_y, subvector32 y, 

24 int tx, int ty, int bx, int by) { 

25 if (ty == 0) 

26 atomicAdd (y+by*32+tx, s_y[tx]); 

27 } 

Listing 2: Routines performing q = Ap 



The CUDA code of all routines of elementary function sgemv is listed 
in Listing [2] There are two load routines (one for matrix tile, one for sub- 
vector), one store routine (saving sub- vector resulting from the reduction) 
and one compute routine, multiplying matrix tile with sub- vector. The macro 
SGEMV_1_BY is expanded to the selected y-size of the block and the function 
is implemented to run only in single instance per block (as there is enough 
parallelism, the execution of multiple instances per block is not necessary, 
contrary to unnested functions). As it can be seen, the code of sgemv is 
quite low-level, but still reasonably simple. 

The metadata are associated with CUDA code of sgemv, determining 
the parallelism required by single instance of the function, higher-order func- 
tion and data padding. Optionally, access pattern defining thread-to-data 
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mapping can be denned |TJ . The demand for writing these function's proper- 
ties into metadata brings no significant programming overhead, as they have 
to be known to the programmer implementing the function. 

Algorithm 3 Fused q = Ap, s = A T r 
1: allocate A h p h q h r h si in shared memory 
2: compute thread indices 

3: compute tile indices x block. x, y <— i • block.y 
4: pi <— load(p, x) 

5: Si <- 

6: for y' = y;y' < min(n, y + i); do 
7: ri load(r, y') 
8: A\ <r- load(A,x,y') 
9: si <— compute^gemtv(Ai,ri,x,y') 
10: qi <- 

11: qi 4— compute_gemv(Ai,pi,x,y') 

12: q store(qi,y') 

13: y' ^ y' + 1 

14: end for 

15: s ^— store(si, x) 



The pseudo-code of the generated fused kernel of BiCGK sequence is listed 
in Algorithm [3| The algorithm has several inputs: A is an m x n matrix, 
p, s are vectors of size m, and q, r are vectors of size n. Load, compute and 
store routines, which are called in the generated code, are present in the 
library of elementary functions. In the optimization space searching phase, 
the compiler has decided to perform several serial iterations in each instance, 
thus, the for loop going over several matrix tiles is to be generated in the 
kernel. 

The code generation works as follows. The compiler generates a shared 
memory allocation for all on-chip variables. Each variable in shared memory 
can be padded — in this example, A is allocated as array of size 33 x 32 to 
allow conflict-free parallel access to columns. After memory allocation, the 
compiler generates the computation of the thread indices and block indices 
x, y, where y is stridden according to number of serial iterations i. When 
indices are computed, the routines can be called. The local part of vector 
p loaded into p\ is invariant across iterations, and the output of the partial 
reduction si can be accumulated across iterations. Thus, the compiler puts 
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loading of pi and zeroing of si before the loop. Within the for loop, the local 
part of r and A are loaded to r\ and Ai, and Afri is computed and added 
to si (line 9). To compute Aipi (line 11), p\ is zeroed (line 10) and stored 
(line 12) after the computation in each iteration. When all the iterations are 
finished, accumulated result in si is stored. We note that for simplicity, local 
synchronizations are not shown in the pseudo code. The complete generated 
routines and kernel code in C for CUDA of fused BiCGK kernel are listed 
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Figure 4: A data usage of a single instance of the fused BiCGK. 

The data movement in the computation is illustrated in Figure |4j The 
single instance of BiCGK processes two tiles of A in depicted example (i — 2), 
thus two sub- vectors of vectors r, q, and one sub- vector from both p and s 
are moved between global and on-chip memory. The instances are created 
m x | times over the matrix. 

The important property of the algorithm described above is that Ap can 
be fused together with A T r, although the dot products of the multiplied 
vectors and matrix A are performed using rows as well as columns of the 
matrix — the only difference is in the placing of routines call with respect to 
the loop (for invariants or accumulable output). 



5. Evaluation 

In this section, the optimization of sequences of BLAS calls is evaluated. 
First, various sequences of linear algebra kernels used in our experiments are 
defined and the possibilities for optimizing them are analyzed. Second, the 



21 



performance of implementations generated by our compiler is evaluated and 
compared with implementations using CUBLAS. Finally, the accuracy of the 
performance prediction method and compiler timing is analyzed. 

5.1. Experiment setup 

To test the code efficiency of our compiler, we have used the same se- 
quences as in [2] , which are specified in Table [I] These sequences are a rep- 
resentative selection of generally interesting operations, where many of them 
have important applications (BiCGK is used in biconjugate gradient method, 
ATAX in normal equations computation), are added to the updated BLAS 
specification (AXPYDOT, SGEMVT, GEMVER, GESUMMV, WAXPBY) 
|17j . are in original BLAS specification (SGEMV and SSCAL) or are gener- 
ally usable (MADD, VADD). Some of these sequences can be significantly im- 
proved by fusions whereas others cannot. The adoption of sequences from [2] 
allows us to compare effect of fusion on two different processors — multi-core 
CPU and many-core GPU. The only difference between our sequences and 
those used in [2] is in the floating point accuracy — we have used the single 
precision version of all sequences, whereas in [2] double precision has been 
usec0 

We have assigned tags to each sequence in Table [TJ These tags indicate 
optimizations that our compiler is able to perform. Tag F indicates that 
fusion can be used to improve performance, tag S indicates that more spe- 
cialized kernels that save some work compared to CUBLAS can be generated. 
Finally, tag B indicates sequences that have their equivalents in CUBLAS, 
thus any optimization that can be used by our compiler can also be imple- 
mented in CUBLAS. When some tag is enclosed in brackets, its significance 
is low, i. e. is related to BLAS-1 operations in sequences where much more 
time-consuming BLAS-2 operations are executed. 

For some sequences, the tag assignment does not have to be straightfor- 
ward, thus we discuss it in more detail. 

• ATAX and SGEMVT cannot be improved by fusion. In both cases, 



5 Note that the selection of different precision should not affect comparison of speedups 
reached by our compiler and BTO BLAS. Although double amount of data are transfered 
when double precision is used, the CPU SSE peak performance in double precision is a half 
of single precision performance, thus the ratio of memory to arithmetic throughput is same 
for both single and double precisions and therefore the effect of fusions should be same. 
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Sequence 


Operation 


Tag 


AXPYDOT 


z <- 


- w — av 




FS 




r f- 


T 

- z u 






ATAX 


v <- 


- A T Ax 






BiCGK 




- Ap 




F 




s f- 


- A T r 






SGEMV 


z f- 


- aAx + /3y 




B 


SGEMVT 


x <- 


- (3A T y + z 




(S) 




W i 


— aAx 






SSCAL 


X <r 


- ax 




B 


GEMVER 


B < 


- A + uivj + U2V2 


FS 




x <- 


- (3B T y + z 








W i 


— aBx 






GESUMMV 


y <- 


- aAx + (3Bx 




(F) 


MADD 


c< 


- A + B 




S 


VADD 


X <r 


- w + y + z 




FS 


WAXPBY 


W i 


- ax + f3y 




F 



Table 1: Sequences used in our performance study, adopted from [2]. Tags: F=improvable 
by the fusion, S=improvable by kernel specialization, B=equivalent of CUBLAS kernel. 

matrix A is used twice, but a global barrier is needed between uses of 
A, and thus must be used in separate kernels. 

• GESUMMV can spare the reading of vector x when it is performed in 
a single kernel. However, because of reading the matrices A and B, the 
amount of data transfer is almost the same in the fused and unfused 
versions. 

• All sequences with the S tag require memory copy or cleaning in the 
CUBLAS implementation because of the in-place implementation of 
some CUBLAS kernels, whereas kernels generated by our compiler can 
work out-of-place. 

5.2. Performance Results 

All experiments have been performed on a machine equipped with an 
Intel Core2 Q9550 (2.83 GHz), 8 GB RAM, and a GeForce GTX 480. Ubuntu 
10.04 with CUDA Runtime 5.0 and Driver 304.54 have been installed. 

Table [2] compares performance of code generated by our compiler with 
CUBLAS implementations. 
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Sequence 


Our compiler 


CUBLAS 


Speedup 


Tag 


AXPYDOT 


38.3 GFlops 


19.7 GFlops 


1.94x 


FS 


ATAX 


73.5GFlops 


71.5 GFlops 


1.03x 




BiCGK 


115 GFlops 


71.5 GFlops 


1.61x 


F 


SGEMV 


73.3 GFlops 


69.9 GFlops 


1.05x 


B 


SGEMVT 


73.3 GFlops 


71.5 GFlops 


1.03x 


(S) 


SSCAL 


18.2 GFlops 


17.3 GFlops 


1.05x 


B 


GEMVER 


83.4 GFlops 


31.9 GFlops 


2.61x 


FS 


GESUMMV 


73.4 GFlops 


73.1 GFlops 


lx 


(F) 


MADD 


11.3 GFlops 


7.68 GFlops 


1.47x 


S 


VADD 


20.0 GFlops 


8.84 GFlops 


2.26x 


FS 


WAXPBY 


36.4 GFlops 


18.9 GFlops 


1.93x 


F 



Table 2: Performance comparison of generated and CUBLAS implementations of studied 
sequences. 

In all cases, the generated implementation performs better or similarly 
compared to CUBLAS. Significant speedup is obtained in the case of se- 
quences where the fusion can improve memory locality (tag F) as well as 
when kernel specialization is possible (tag S). Those sequences demonstrate 
the strength of our compiler, as they are improved by compiler's optimiza- 
tions which cannot be implemented in CUBLAS (without modification of its 
API). 

We have analyzed the scaling of our generated implementations. Com- 
pared to CUBLAS, the performance is generally better for small problems, 
which is probably caused by the smaller thread blocks that our code uses and 
better latency hiding when fused kernels are used. The smoothness of per- 
formance for various data sizes is similar to that of CUBLAS. The scaling of 
BiCGK and GEMVER sequences is depicted in Figures [5] and [6j respectively 
Besides these examples, other sequences used in this study scale similarly. 

To the best of our knowledge, there is no other system allowing fusion of 
BLAS functions for GPUs that could be compared with our results. Nev- 
ertheless, we can compare the relative speedup of our generated codes with 
relative speedup of CPU code generated by BTO BLAS [2] , see Table [3j 

Our speedup is generally better comparing to the speedup of BTO BLAS 
when fusion can be used (sequences AXPYDOT, BiCGK, GEMVER, VADD, 
WAXPBY). Our compiler is more successful with sequences equivalent to 
BLAS functions (SGEMV) or sequences with reduced opportunity to be im- 
proved by fusion (GESUMMV) — in our case, the performance is comparable, 
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matrix size 

Figure 5: Performance of BiCGK sequence. 

whereas BTO BLAS generates slower codes. The main reason is probably 
that our compiler fuses hand-written routines, that can be adequately tuned. 
The exactly same speedup is shown in the case of MADD, where only kernel 
specialization takes place. 

On the other hand, BTO BLAS has a wider opportunity to enhance 
code performance using fusions. When the function / performs reduction on 
each row of the matrix and the reduction's result is an input of function g 
processing the same row, CPU is able to hold the row in the cache and reuse it 
after reduction finish (thus outer loops in / and g going over rows are fused, 
whereas inner loops are unfused). Considering GPU, the row needs to be 
partitioned among more thread blocks when is read into on-chip memory by 
/, thus thread blocks needs to be synchronized before the result of reduction 
is available. Our compiler performs the synchronization by the new kernel 
invocation, thus all on-chip data are lost before the result of the reduction is 
available for g so no row data can be reused. The only option how to reuse 
row data on GPU is to use persistent threads [18], but it is not clear if it 
could have a positive performance impact, as the inter-block synchronization 
is possible, but decreases the performance. The wider fusion opportunity of 
BTO BLAS caused better speedup in sequences ATAX and SGEMVT. 

As sequences analyzed in this section are memory-bounded even after fu- 
sion, their arithmetic throughput is far from the peak throughput of GeForce 
GTX 480. To determine generated kernels efficiency, we have measured their 
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Figure 6: Performance of GEMVER sequence. 



bandwidth (shown in last column of Table [3]). Note that the bandwidth of 
fused kernels is measured (i. e. only bandwidth of data that are really trans- 
fered from or to global memory), which gives us the information about real 
efficiency of kernel implementations hiding improvements of the fusion. The 
maximal theoretical bandwidth of GeForce GTX480 is 177.4 GB/s. However, 
this bandwidth is unreachable in practice. The most of our kernels reaches 
more than 75 % of theoretical peak, which can be considered as a very good 
result (there is no significant chance to improve their performance by tuning 
the compiler). The BiCGK kernel reaches about 65 % of the peak bandwidth. 
We have experimented manually with tuning this kernel, reaching 78 % of the 
peak bandwidth when loops of load and compute routines iterating over ma- 
trix tile are manually fused, showing further challenges in automatic kernel 
fusion. 



5.3. Performance Prediction Accuracy 

We have analyzed the accuracy of the performance prediction. As numer- 
ous possible implementations can be generated, good performance prediction 
allows the reduction of the empirical search to several promising candidates 
or eliminate empirical searching entirely. 

Table [4] shows the number of possible implementations of each sequence 
and the rank of the fastest implementation. As we can see, the best im- 
plementation is not generated as the first one in six cases. However, the 
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Senuence 


Our 


BTO BLAS 


Our meniory 




SDPPflnn 


snppdun 


bandwidth 


AXPYDOT 


1.94x 


1.58X 


153.2 GB/s 


ATAX 


1.03x 


1.37x 


147GB/s 


BiCGK 


1.61x 


1.5x 


115 GB/s 


SGEMV 


1.05x 


0.83x 


146.6 GB/s 


SGEMVT 


1.03x 


1.29x 


146.6 GB/s 


SSCAL 


1.05x 


n/a 


145.6 GB/s 


GEMVER 


2.61x 


2.37x 


143 GB/s 


GESUMMV 


lx 


0.93x 


146.8 GB/s 


MADD 


1.47x 


1.47x 


135.6 GB/s 


VADD 


2.26x 


1.83x 


160 GB/s 


WAXPBY 


1.93x 


1.88x 


145.6 GB/s 



Table 3: Comparison of the speedup of sequences generated by our compiler and best 
cases generated by BTO BLAS and the memory bandwidth of our kernels. 



Sequpncp 


Impl. 


Best impl. 


First impl. 


Worst impl. 


name 


count 


found 


performance 


performance 


AXPYDOT 


25 


4th 


75.2 % 


34.9 % 


ATAX 


1 


1st 


100% 


n/a 


BiCGK 


5 


1st 


100% 


64.0 % 


SGEMV 


83 


14th 


99.2 % 


97.8 % 


SGEMVT 


41 


5th 


99.8 % 


99.4 % 


SSCAL 


1 


1st 


100% 


n/a 


GEMVER 


1271 


54th 


98.7 % 


43.1% 


GESUMMV 


415 


51st 


99.6 % 


94.4 % 


MADD 


1 


1st 


100% 


n/a 


VADD 


41 


14th 


94.6 % 


50.4 % 


WAXPBY 


83 


1st 


100% 


29.3 % 



Table 4: For each studied sequence, the count of all implementations is shown in the 
second column, the rank of the best generated implementation is shown in the third 
column, the performance of the first generated implementation relative to the best one is 
in the fourth column and the performance of the worst implementation relative to the best 
performing implementation is shown in the fifth column. To eliminate measurement error, 
all implementations for which performance does not differ more than 0.1 % are considered 
to have same performance. 
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Sequence 


First 


All 


Empirical 




implementation 


implementations 


search 


AXPYDOT 


0.144s 


0.241s 


1 m 59 s 


ATAX 


0.137s 


0.144s 


5s 


BiCGK 


0.140 s 


0.164s 


18s 


SGEMV 


0.152s 


0.900 s 


8m22s 


SGEMVT 


0.123s 


0.393 s 


4m 42 s 


SSCAL 


0.139s 


0.113s 


3s 


GEMVER 


0.133s 


42.165 s 


3h24m36s 


GESUMMV 


0.123s 


5.707 s 


48 m 23 s 


MADD 


0.128s 


0.116s 


4s 


VADD 


0.133s 


0.248 s 


3m3s 


WAXPBY 


0.156 s 


0.731s 


7ml4s 



Table 5: Time of compilation and empirical search for tested sequences. 



performance of the first generated implementation is reasonably close to the 
best one except the AXPYDOT routine (see fourth column of the table). 

The selection of inefficient implementation of AXPYDOT is caused by 
a systematic error in the performance prediction method which underesti- 
mates performance of fused kernels. The error is probably caused by ignor- 
ing kernel startup overhead and serial code optimizations. As performances 
of the fused and unfused versions of AXPYDOT are relatively close, the 
compiler wrongly expects unfused version to run faster. 

The last column of Table [4] shows the performance of the worst gener- 
ated implementation compared to the best one. As we can see, the worst 
generated implementations often perform poorly, thus the sorting of possible 
implementations by predicted performance is crucial. 

5.4- Compilation Time 

The compilation time and empirical search time are given in Table |5j As 
we can see, compilation time is rather same when only implementations with 
best predicted performance are generated. When all possible implementa- 
tions (given by combinations of fusion implementations) are generated, the 
compilation time is still feasible: it reaches at most tens of seconds in compi- 
lation of GEMVER sequence (1271 implementations generated), less then 6 
seconds in GESUMMV and less than 1 second all other sequences. However, 
the time for empirical search for best performing implementation increases 
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proportionally with the number of implementations. Fortunately, the em- 
pirical search has small impact on performance and if it is used, only a few 
implementations needs to be generated and benchmarked to have a good 
chance to find the best performing one (see Table [4]). 

6. Conclusions and Future Work 

In this paper, we have significantly extended our approach to automatic 
kernel fusion by introducing fusion of reduce and nested map and reduce 
kernels. We have shown that kernel fusion can improve the performance of 
memory-bound kernels. Although the fusion of general kernels is difficult 
to automate, we have argued that the automation is possible and demon- 
strated the automation for (possibly nested) map and reduce kernels using 
our source-to-source compiler. The application of our compiler has been 
demonstrated by fusing sequences of BLAS calls, where a significant speedup 
comparing to CUBLAS has been observed. 

We plan to focus on generalization of the presented method, allowing 
to fuse more types of kernels, work with irregular data structures or target 
multiple GPUs. 

• Support for more types of fusible kernels. A more general model of 
temporal locality in GPU memory hierarchy could be formulated, which 
would allow us to handle more types of kernels, such as stencils, scatters 
or gathers. Supporting more general kernels significantly extends the 
applicability of our compiler, e. g. in image processing, ODE solvers, 
or Finite Difference Method. 

• Support for irregular data types. The operations working with irregular 
data types, such as triangular or diagonal matrices, or sparse arrays, 
need more general higher-order functions, than are currently supported. 
Irregular or sparse structures enrich application area of our compiler, 
e. g. for large simulation of physical phenomena. 

• Support for multi-GPU computations. To allow scaling of GPU appli- 
cations, the workload needs to be distributed among multiple GPUs. 
While the distribution of map and reduce is quite straightforward, more 
complicated functions, such as nested map and reduce or stencils yield 
significantly more difficult data exchange pattern. 
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Besides improving the compiler, we are going to develop libraries of el- 
ementary functions. We plan to implement more functions from the BLAS 
standard which are fusible by the compiler and a library of linear algebra 
operations on small elements usable for element subroutines in FEM. 
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Appendix A. Generated Code of BiCGK Sequence 



1 global void cu_sgemv_0_sgemtv_2_computation (TILE32x32 in_0xlf 97c00, 

2 subvector32 in_0xlc88ce0, subvector32 in_0xld7 6560 , 

3 subvector32 out_0xld2c980, subvector32 out_0xlc71da0, 

4 int sx, int sy) 

5 { 

6 int tx = threadldx . x; 

7 int ty = threadldx . y ; 

8 int bx = blockldx.x; 

9 int by = blockldx.y; 
10 

11 shared float s_f usion [ 1152 *1 ] ; 

12 float* s_sgemtv_2_out = s_fusion 4- (32*1); 

13 float* s_sgemv_0_out = s_fusion + (0*1); 

14 float* s_0xlc88ce0 = s_fusion + (1120*1); 

15 float* s_0xlf97c00 = s_fusion + (64*1); 

16 float* s_0xld76560 = s_fusion + (0*1); 
17 

18 //data loading 

19 d_sgemv„l__load_2_bx32_by4_bzl_s (in_0xlc88ce0, s_0xlc88ce0, 

20 tx, ty, bx, by) ; 
21 

22 //clear output of reduction 

23 if (ty == 0) s_sgemtv_2_out [tx] = O.Of; 
24 

25 by = by* 8; 

26 int stop = min(by + 8, sy) ; 

27 for (; by < stop; by++) { 

28 //data loading 

29 d_sgemtv_l_load_l_bx32_by4_bzl_s (in_0xlf 97c00, s_0xlf97c00, 

30 tx, ty, bx, by, sx) ; 

31 //data loading 

32 d_sgemtv_l_load_2_bx32_by4_bzl_s ( in_0xld7 65 60 , s_0xld76560, 

33 tx, ty, bx, by) ; 

34 syncthreads () ; 

35 

36 //computation 

37 d_sgemtv_l_compute_bx32_by4_bzl_sss ( s_0xlf 97c0 , s_0xld76560, 

38 s_sgemtv_2_out, tx, ty) ; 

39 syncthreads () ; 

40 

41 //clear output of reduction 

42 if (ty == 0) s_sgemv_0_out [tx] = O.Of; 

43 syncthreads () ; 

44 

45 / /computation 

46 d_sgemv_l_compute_bx32_by4_bzl_sss ( s_0xlf 97c0 , s_0xlc88ce0, 

47 s_sgemv_0_out , tx, ty) ; 

48 syncthreads () ; 
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49 

50 //data storing 

51 d_sgemv_l_save_bx32_by4_bzl_s ( s_sgemv_0_out , out_0xld2c980, 

52 tx, ty, bx, by) ; 

53 syncthreads ( ) ; 

54 } 
55 

56 //data storing 

57 d_sgemtv_l_save_bx32_by4_bzl_s ( s_sgemtv_2_out , out_0xlc7 IdaO , 

58 tx, ty, bx, by) ; 



59 } 

Listing 3: Generated BiCGK kernel 
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